Exploratory Data Analysis

Ted Laderas

2019-04-08

Our Overall Goal

What is NHANES and why are we looking at it?

Please Note

NHANES is a valuable dataset in many ways

VariableName Definition
Age Age in years at screening of study participant. Note: Subjects 80 years or older were recorded as 80.
AlcoholDay Average number of drinks consumed on days that participant drank alcoholic beverages. Reported for participants aged 18 years or older.
BMI Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older.
BMI_WHO Body mass index category. Reported for participants aged 2 years or older. One of 12.0_18.4, 18.5_24.9, 25.0_29.9, or 30.0_plus.
BPSysAve Combined systolic blood pressure reading, following the procedure outlined for BPXSAR.
Depressed Self-reported number of days where participant felt down, depressed or hopeless. Reported for participants aged 18 years or older. One of None, Several, Majority (more than half the days), or AlmostAll.
Education Educational level of study participant Reported for participants aged 20 years or older. One of 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
Gender Gender (sex) of study participant,coded as male or female
HardDrugs Participant has tried cocaine, crack cocaine, heroin or methamphetamine. Reported for participants aged 18 to 69 years as Yes or No.
HHIncome Total annual gross income for the household in US dollars. One of 0 - 4999, 5000 - 9,999, 10000 - 14999, 15000 - 19999, 20000 - 24,999, 25000 - 34999, 35000 - 44999, 45000 - 54999, 55000 - 64999, 65000 - 74999, 75000 - 99999, or 100000 or More.
LittleInterest Self-reported number of days where participant had little interest in doing things. Reported for participants aged 18 years or older. One of None, Several, Majority (more than half the days), or AlmostAll.
Marijuana Participant has tried marijuana. Reported for participants aged 18 to 59 years as Yes or No.
MaritalStatus Marital status of study participant. Reported for participants aged 20 years or older. One of Married, Widowed, Divorced, Separated, NeverMarried, or LivePartner (living with partner).
Race1 Reported race of study participant: Mexican, Hispanic, White, Black, or Other.
Race3 Reported race of study participant, including non-Hispanic Asian category: Mexican, Hispanic, White, Black, Asian, or Other. Not availale for 2009-10.
RegularMarij Participant has been/is a regular marijuana user (used at least once a month for a year). Reported for participants aged 18 to 59 years as Yes or No.
AgeRegMarij,"Age of participant when first started regularly using marijuana. Reported for participants aged 18 to 59 years.
SleepHrsNight Self-reported number of hours study participant usually gets at night on weekdays or workdays. Reported for participants aged 16 years and older.
SleepTrouble Participant has told a doctor or other health professional that they had trouble sleeping. Reported for participants aged 16 years and older. Coded as Yes or No.
SurveyYr Which survey the participant participated in.
TotChol Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older.
TVHrsDay Number of hours per day on average participant watched TV over the past 30 days. Reported for participants 2 years or older. One of 0_to_1hr, 1_hr, 2_hr, 3_hr, 4_hr, More_4_hr. Not available 2009-2010.

Outcomes

We can understand an outcome and look at its association with measured variables in the data.

We’ll look at Depression today, but there is also Physical Activity and Diabetes Status as well

Before we Start

Take a Look at the Data as a Sheet

NHANES Extract in Google Sheet Form

What is Exploratory Data Analysis?

Remember

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” - John Tukey, Exploratory Data Analysis

EDA is about Visualization First

Running the Explorer App

We’ll start exploring the data immediately!

Go to the app:

Map your questions to a tab:

What is the Overview Tab for?

Data Explorer

Data Explorer

Overview Tab

Let’s try it!

  1. How many categories are there for the Depressed variable? (in R, we call them factors)
  2. How many missing cases are there for Depressed?
  3. What is the mean age for the dataset?
  4. How is the Depressed variable defined in this dataset?

What is the Category Tab for?

Categorical Tab

Categorical Example

Do people with the most days of LittleInterest also have the most days of Depression?

Categories: Let’s try it!

  1. What is the category with the largest counts for Depressed?
  2. Do the proportions of people with your outcome look the same for those who use marijuana versus those who don’t use it?

Why is the Data Missing?

Many reasons for the data being missing from a variable!

Things to Consider about Missing Data

NA: When Missing Data is Valuable

Assessing Missing Data: NaNiar

Assessing Missing Data: NaNiar

What do you do if the data is missing?

Depends on what you want to do:

Continuous Tab

Continuous Scatter

If you get less hours of sleep per night, does that mean you have a higher BMI?

Continuous Boxplot

If you have a lot of depressed episodes, do you also get less sleep?

Continuous: Let’s Try it!

Depression Questions

  1. Is there an association of LittleInterest with Depressed?
  2. Is marijuana use associated with depression?
  3. Is Hard Drug use associated with depression?
  4. How are sleep and marijuana use related in the dataset?
  5. Is there a relationship between Sleep hours and depression?
  6. Is there a relationship between Sleep hours and Age?
  7. Is there a relationship between hours TV watched and Depression?
  8. Or, choose a question! It should look at least two variables.

Let’s learn from each other

Each group should present the findings from 1 interesting question:

  1. Where did you find it in the app?
  2. What variables did you look at? How were they defined?
  3. What did you expect in terms of the relationship?
  4. For another variable, assess the association of your non-Depressed variable with it.
  5. What did you find?

Some Final Notes about NHANES

Congratulations

You are now a full fledged data explorer!

https://waynepelletier.com/work/tasty-icons

Burro

R package that lets you explore your data:

http://laderast.github.io/burro

Are people interested in an optional session?

What is Data Science?

What are Data Science Skills?

Statistics is only helpful if

You’re convinced that the effect you’re interested is real.